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Genetic association study is an essential step to discover genetic 
factors that are associated with a complex trait of interest. In this 
paper we present a novel generalized quasi-likelihood score (GQLS) 
test that is suitable for a study with either a quantitative trait or 
a binary trait. We use a logistic regression model to link the pheno- 
typic value of the trait to the distribution of allelic frequencies. In our 
model, the allele frequencies are treated as a response and the trait is 
treated as a covariate that allows us to leave the distribution of the 
trait values unspecified. Simulation studies indicate that our method 
is generally more powerful in comparison with the family-based as- 
sociation test (FBAT) and controls the type I error at the desired 
levels. We apply our method to analyze data on Holstein cattle for 
an estimated breeding value phenotype, and to analyze data from 
the Collaborative Study of the Genetics of Alcoholism for alcohol de- 
pendence. The results show a good portion of significant SNPs and 
regions consistent with previous reports in the literature, and also 
reveal new significant SNPs and regions that are associated with the 
complex trait of interest. 

1. Introduction. Recent biological technology allows researchers to per- 
form genome-wide association studies using a dense panel of SNPs at an af- 
fordable cost. Association studies have been widely used to identify genome 
regions that are associated with a complex trait of interest. Current meth- 
ods in genetic association studies can be roughly categorized into two ap- 
proaches: (1) studies on samples of unrelated subjects; (2) studies on samples 
of related subjects, from nuclear families, extended families, or from iso- 
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lated/founder populations which often include inbred individuals that are 
related through multiple lines of descent. 

The classical population-based association test in a case-control study de- 
sign is the simplest approach where unrelated affected (cases) and unaffected 
(controls) individuals are typed. However, for a rare disease, it is difficult to 
recruit independent cases in the general population, and, more importantly, 
the naive analysis of data from a general population recruitment design may 
lead to false positive signals due to confounding effects caused by the pop- 
ulation structure. Many researchers [Ewans and Spielman (2003); Khoury 
and Yang (1998); Lander and Schork (1994)] have reported and discussed 
aspects of this problem. For example, the confounding effect of ethnicity is 
well known as the population stratification effect in the genetics literature. 
For an association test with a quantitative trait, a simple linear regression 
model is often used. As noted, the association tests of quantitative traits 
via population-based approaches are also subject to the same problem of 
confounding by the population stratifications. 

The family-based association study design using the family based asso- 
ciation test (FBAT) analysis method has become popular, as this strategy 
is robust to the population heterogeneity [Horvath, Xu and Laird (2001); 
Laird, Horvath and Xu (2000)]. In FBAT analysis, a statistic U is com- 
puted on the basis of the linear combinations of offsprings' genotype and 
phenotype expression functions. The mean and the variance of U under the 
null hypothesis of no association is calculated conditional on the parental 
genotype. Thus, FBAT methods typically require the typing of family mem- 
bers, such as parents or siblings (for inferring a missing parental genotype) 
of each affected subject to make use of such a subject in the test. This 
becomes a limitation of the method. For example, for a late onset disease, 
it is difficult and sometimes impossible to collect the information of the 
family members of an affected subject. On the other hand, FBAT typically 
requires heterozygous parents to compute the null distribution of the test 
statistic. Moreover, when dealing with a large pedigree, FBAT breaks down 
the pedigree to small nuclear families, such that the relationship among re- 
motely related individuals are ignored. Similarly, FBAT does not take into 
account for the relationship across related families in the analysis. For these 
reasons, a family-based approach is generally less powerful in comparison 
with population-based approaches [Risch and Teng (1998); Bourgain et al. 
(2003); Thornton and McPeek (2007)]. 

Slager and Schaid (2001) have proposed a method that was based on 
the Armitage trend test with the inclusion of a variance that accounts for 
the relationships among individuals from an outbred population. However, 
this method cannot handle large, complex, inbred pedigrees. A different 
approach, a pedigree disequilibrium test, proposed by Martin, Bass and Ka- 
plan (2001) can be employed to handle large pedigree association analysis. 
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A founder /isolated population-based study design has been suggested [Lan- 
der and Schork (1994); Wright, Carothers and Pirastu (1999)] for association 
mapping. This study design efficiently controls the confounding effect due 
to population structure and has been useful for complex trait mapping. Re- 
cently, Bourgain et al. (2003) proposed a case-control association test where 
subjects are sampled from a founder population with known genealogy. They 
adapted the idea of a population-based association test to test whether the 
allele frequencies of a specified allele are equal between the case group and 
control group, taking into account the correlations among subjects and the 
inbreeding configuration within subjects. This method can be used to an- 
alyze data from a large inbred pedigree and is also suitable for data from 
multiple pedigrees with careful control of ethnic homogeneity [Thornton 
and McPeek (2007)]. The test is based on a quasi-likelihood scoring (QLS) 
approach and has been shown to be more powerful than the traditional 
transmission/disequilibrium test (TDT) when samples are from homoge- 
neous populations. However, these approaches are limited to binary traits. 

Following the line of quasi-likelihood approach proposed by Bourgain et al. 
(2003) and Thornton and McPeek (2007) to handle the correlation structure 
among related subjects, we propose a generalized linear model framework 
to accommodate other types of traits. We use a logistic regression model 
to link the trait to the distribution of allelic frequencies. In our model, the 
observed trait of each individual is treated as a covariate. The proportion 
of a specified allele in the genotype is the response. In conventional models, 
the phenotypic trait is treated as the response and the distribution of the 
trait values needed to be specified. For example, the normality assumption 
is often required for a quantitative trait. In our method, the trait is treated 
as an explanatory variable, which allows us to leave the distribution unspec- 
ified. On the other hand, treating the allele frequencies of the marker as the 
response, we have the exact covariance structure for the responses with the 
provision of the pedigree structure or the documented genealogy. Under this 
innovative modeling, we derive the test statistic (Wg) and show that Wq 
asymptotically follows a xt-i distribution, where k is the number of alleles 
of the marker. Our proposed GQLS test generalizes the existing approaches 
in three aspects: (1) the GQLS method can establish associations between 
marker's allele frequencies and all types of traits; (2) it uses a general link 
function to connect the mean value of the allele frequency with the traits; 
(3) our GQLS method can be extended to solve the problem when a sample 
is collected from multiple subpopulations. In this article we focus on the lo- 
gistic link, but the extension of our test to other link functions, for example, 
the probit function, would be straightforward. 

This paper is motivated by the challenges of analyzing data on Holstein 
cattle in North America. The aim of this study is to identify SNPs or genome 
regions that are associated with the estimated breeding values (EBVs) of 
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a proven bull. The EBV of a bull predicts its genetic merit. For example, the 
milk yield EBV of a bull predicts the milk yield of its female descendants. 
Conducting an association study in this data set is challenging. First, dams 
are not typed, and sires are typed only if they appear as proven bulls in the 
data set. Thus, FEAT is not applicable to analyze this data set. Second, 
most of the bulls, sires and dams, are inbred. They are descendants from 
a single complex pedigree and the relationships among them are known but 
complicated. The conventional population-based association test does not 
account for this complex relationship among subjects. Ignoring the correla- 
tion structure among subjects would lead to an inflated positive result. This 
will be shown by simulation studies in the paper. Third, the case-control 
founder-population-based approach proposed by Bourgain et al. (2003) is 
limited to binary traits where most of the EBVs are quantitative. Thus, the 
challenge of analyzing this data set becomes a motivation for the develop- 
ment of our method. 

We perform simulation studies on collections of pedigrees of various sizes 
and on single complex pedigrees with different sizes to validate our method. 
We compare the empirical performance of our method with others. In appli- 
cation, we also apply our method to the Collaborative Study of the Genetics 
of Alcoholism (COCA) data provided by the Genetic Analysis Workshop 
(GAW) 14 [Edenberg et al. (2005); Bailey- Wilson et al. (2005)] to demon- 
strate the application in the binary trait and multiple small families study 
design. 

The paper is organized as follows. Section 2 presents the proposed gen- 
eralized quasi- likelihood association test. Section 3 presents the details of 
simulation studies to assess the validity and the power of the proposed test 
compared with other methods. In Section 4 applications to real data are 
provided to illustrate the practical application of the proposed method. Dis- 
cussions are provided in Section 5. 

2. Methods. 

2.1. Association test with a hiallelic marker. Suppose that in a genetic 
study we have a sample of n subjects that is from a single isolated/founder 
population or a single pedigree. Subjects may be arbitrarily related with 
a known relationship. It is assumed that the inbreeding configuration for each 
subject is also known. Let X = {Xi, . . . ,Xn)' with Xi being the phenotypic 
observation of the zth subject. The Xi can be binary with Xi = 1 or 
coding for "affected" or "unaffected," respectively, or can be continuous for 
a quantitative trait. Given a biallelic marker of interest, alleles are labeled 
by "0" and "1." Let Y = (Yi, . . . , y„)' with Yi = \y. (the number of allele 1 
in subject i) being the proportion of the allele 1 in the observed genotype of 
subject and Yi = 0, ^, or 1. Let /x = (/xi, . . . ,/x„)' = E(Y|X) that < fii < 1. 
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We propose a logistic regression model to link the expected allele frequency /x 
of the marker with the trait X. We let 



(2.1) 



^ii = E{Yi\Xi 



1 + f^Po+hX, ■ 

To test the association between the marker and the trait, we test 

Ho:pi = against iJ^ : /5i 7^ 0. 

Our model provides a natural constraint that < /ij < 1 for all i = 1. 
Under the null hypothesis, we have fii = fj, = ^^^33- for all i = 1, . . . ,n. The 
mean vector of Y no longer depends on Xi and becomes fi = E(Y) = /il, 
where 1 is an n-vector of I's. It can be shown that, under Hq, the covariance 
matrix of Y is given by Sq = ^A*(l ~ A')P) 



,n. 



(2.2) 



/I + 01 2(/>i2 
24>i2 1 + 02 



'2n 



20i„ \ 

202n 
l + 0n/ 



where 0i is the inbreeding coefficient of individual i and (pij is the kinship 
coefficient between individual i and individual j. See Appendix A in the sup- 
plementary material for the justification [Feng et al. (2011)]. The covariance 
matrix XIq will be invertible if /i 7^ 1 or 0, and p is invertible provided that 
the monozygous twins (twins that are genetically identical, as they origi- 
nate from a single fertilized egg) are merged and represented by one single 
individual. This can be done using the multiple outputation procedure [Foll- 
mann, Proschan and Leifer (2003)]. The quasi-likelihood score function is in 
the form of 



(2.3) 5(/3) = {S^,{P),S(sA(3)y = D'^-\Y 

where D is a n x 2 derivative matrix in the form of 



(2.4) 



d(3 



and S is the covariance matrix of Y. Under the null hypothesis, we have 
fj, = III and the covariance matrix S = XIq- The solution to the equation of 
the quasi-likelihood score function 5/3g(/3o,0) = gives an estimate of /j, as 



(2.5) 



(l'p-^1) 



and therefore gives the estimate of as /3o = log t-^^ under the null hy- 
pothesis. See Appendix B in the supplementary material for the derivation 
[Feng et al. (2011)]. 
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When /3i ^ 0, the marker is associated with the trait and the expected 
value of Yi given the Xj is given by equation (2.1). For a binary trait, the 
two-sample model of Bourgain et al. (2003) in the form of 



ill 



p + r, if z is affected, with < + r < 1, 
p, if i is unaffected, with < p < 1 



becomes a special case of our model that p = j^^^- and r = — 

j^^°fig ■ We propose a generalized quasi- likelihood scoring statistic to test the 
association between the marker and the trait. Under the null hypothesis that 
/3i=0, 



E[5;3,(/3o,/3i = 0)] = E 



0. 



As described by Cox and Hinkley (1974), the quasi-score statistic is given by 

(2.6) W = S^, (/3o, 0)' Yai^HS^, 0o, 0))S^, (/3o, 0), 

where /3o is the quasi-likelihood estimate of /3o and var(^^(S'^j (/3o, 0)) is the 
(2, 2)th entry of the inverse of the information matrix I(/3) that is computed 
under the null hypothesis that (3i = 0. As demonstrated by Heyde (1997), 
under the null hypothesis, W follows a distribution with 1 degree of 
freedom asymptotically. In our case, we obtain an explicit expression for our 
generalized quasi-likelihood scoring statistic in the form of 

^C^^[X'p-(Y-;il)r 

(2.7) X [XV'X - (X'p-'1)(1 Vl)"'(l VX)]-i 
x[XV'(Y-Al)], 

where (1 is given by equation (2.5). See Appendix B in the supplementary 
material for the derivation [Feng et al. (2011)]. Note that, in equation (2.7), 
we do not need /3o to compute the Wg statistic. Wq is expressed in a general 
form for both the quantitative and binary traits. When the trait is binary, the 
quasi-likelihood scoring statistic proposed by Bourgain et al. (2003) becomes 
a special case of our Wq that they are the same. Under the null hypothesis, 
Wg follows a Xi distribution asymptotically. 

Following the same line as in Bourgain et al. (2003), we generalize the Wg 
statistic to accommodate F independent families in an outbred population. 
Among n subjects, let be the number of subjects that are from family / 
and let Yj = (li/, . . . , Ynff)' be the vector of y's for subjects that are from 
family /, f = 1, . . . ,F. Then, we have n = ni + ■ ■ ■ + np. Let and pj 
be the covariance and correlation matrix of y's for those subjects that are 
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from the /th family. If all the individuals in the sample are outbred, the 
diagonal entries of matrix are equal to 1 for all f = 1, . . . ,F. The overall 
covariance matrix under the null hypothesis is a block diagonal matrix that 
consists of Hi, . . . j'Sp. We derive that explicit form for the quasi- likelihood 
estimate of /.t under the null hypothesis as 

(2.8) fi- 



\/=i / \/= 

where 1/ is the nj-vector of I's. We derive an explicit form that 

(2.9) = 

where 



/=1 

/=i \/=i / \/=i / 

and is the nj-vector of the traits of the individuals from the /th family. 

2.2. Association test with a multiallelic marker. Now, suppose the marker 
under investigation has k different alleles and there are n individuals being 
sampled from a single pedigree. Let Y = (Y'^^, . . . , Y^_j^)' be an n{k — 1)- 
vector with Yj = (Yji, . . . ,Yjn)' being an n-vector that Yji = ^ x (the num- 
ber of allele j in individual i). Similarly to the biallelic case, we let = 
E(Y|X) = (/^i,. ..,/x'fc_J' with = {nji,...,fijny and 

g/3oj+/3ij^i 

Each random vector 2 x {Yu, . . . ,lfc-i,i)' follows a multinomial (2, (/iij, . . . , 
A'fc-i.i)') distribution with < Hji < 1 and Yl^=i f^ji = 1 for alH = 1, . . . , n. 
Under the null hypothesis that the marker is not associated with the trait, 
all /3ij's are 0. Thus, we perform a simultaneous hypothesis test that 

-f^o : = • • • = /3i,fc-i =0 vs Ha : at least one /3ij j^O, j = 1, . . . ,k — 1. 

Here, we generalize the notation of vector (3 as in the biallelic case that 
(3 = {(3'o, p[y with (3o = (/3oi, . . . , /?o,fc-i)' and = (Ai, . . . , /3i,fc-i)'. Under 
the null hypothesis that f3i = 0, we have Hji = fij for all i and rewrite the 
mean vector = (/iil', . • • j /^fc-il')' where 1 is an n-vector of I's. Under 
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the null hypothesis, the covariance matrix of Y is given by S = F (g) p (the 
Kronecker product of matrices F and p) where F is a (fc — 1) x (A; — 1) matrix, 
which is the same as in Bourgain et al. (2003). Here, let /x* = (/xi, . . . 
be the {k — l)-vector such that fi = fx* 1 under the null hypothesis. We 
show that, under the null hypothesis, the quasi-likelihood estimate of /x* is 
given by 

(2.10) /i* = (Al, . . . , Afc-i)' = (1 Vl)"'(Ife-i ® (1 V))Y, 

where Ik-i is a (/c — 1) x (A; — 1) identity matrix. Thus, fj, = fi* 01. We obtain 
an explicit form of the generalized quasi-likelihood scoring statistic as 

(2.11) WG = C-iY- A)'(F-i (p-iXXV'))(Y - A), 

where C = [X'p-iX-X'p~il(l'p~il)"i(l'p~^X)]"i is a constant depend- 
ing on the trait vector X and the correlation matrix p, and F is computed 
by using the p*. See Appendix C in the supplementary material for deriva- 
tions of p* and Wg in the multiallelic case [Feng et al. (2011)]. Under the 
null hypothesis, Wq follows an distribution with k — 1 degrees of freedom 
asymptotically. Alternatively, we can express the statistic in the form 

fc-i fc-i 

(2.12) VFG = C^^(F-i),,(Y,-/i,l)'p-'XX'p-'(Y,-wl). 

j=i 1=1 

In the biallelic case that k = 2, we have F = ^/i(l — p) and I] = ^p(l — 
p)p, p* and Wg reduce to those that are derived under the biallelic case. 
When the n individuals in the sample comprise subjects that are from F 
independent families, we retain the notation of X^,l^ and pj as in the 
biallelic case. Let Yf = (Y'^j, . . . , Y^„^ j)' and Yjf = {Yji, . . . jYjnf)'- The 
statistic Wg is given by 

^G = C-^j;(F-i),, 

j=i 1=1 

(2.13) 



' F F '\ 

./=1 /=1 J 



where C = {^U ^W^S " ^W^f^^^^U 1>7' V)"'}"'- Un- 

der the null hypothesis, Wg follows an distribution asymptotically. 



2.3. Data collected from multiple subpopulations. In this paper we extend 
our GQLS method to a solution that overcomes the problem of population 
stratification. Suppose a sample is collected from S different subpopulations, 
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denoted by popi, . . . ,popg. For illustration, let the marker of interest be bi- 
allelic (e.g., an SNP). For each subpopulation, pop^, we compute a GQLS 

(s) (s) o 

test statistic, Wq . We know that the W)^ follows Xi distribution asymp- 
totically. In statistical theory, the sum of «5 independent random variables 
follows an distribution with the degrees of freedom being the sum of the S 
degrees of freedom. Thus, a new overall statistic, which is the sum over all 
subpopulations, having the form as 

WaU = Wg^ +Wg^ + --- + W^^'> 

follows an Xs distribution asymptotically under the null hypothesis. 

It is well known that FBAT is robust to the analysis of family data col- 
lected from different populations. We will compare the performance of our 
overall test method with FBAT in the population stratification problem via 
simulation studies. We will also apply this overall test method to the COGA 
data set. See Sections 3.3 and 4.2 for details. 

3. Simulation study. We conduct simulation studies to validate the 
distribution approximation to the distribution of the Wq statistic and to 
compare the power achieved by our approach with the power achieved by 
the FBAT. We consider three different study designs. First, we simulate 
single large complex pedigrees. Second, we simulate multiple small families. 
Third, for each study design, we combine samples simulated under settings 
to mimic a sample collected from different subpopulations to investigate the 
robustness of our extended method using the Waii statistic. Since SNPs are 
popular for genetic association studies and SNPs are typically biallelic, we 
simulate biallelic markers for demonstration. We use the software Kinlnbcoef 
[Bourgain (2003)] to compute the kinship-inbreeding coefficient correlation 
matrix p. We will describe the simulation procedures and summarize the 
results for each design in the following three subsections. 

3.1. Single large pedigree study design. In this study design a family is 
grown starting from a single individual. Each single individual is assigned 
a spouse with probability 0.8 or remains single with probability 0.2. For 
each couple, we generate the number of offspring according to a Poisson 
distribution with mean 3. Any pedigree that stops growing before the com- 
pletion of six generations by natural degeneration, or stops before reaching 
to a desired family size, is disregarded. A new pedigree is grown until we ob- 
tain one single pedigree that consists of six generations and has a desirable 
number of family members in the last three generations. In our simulation 
study, we generate three large single outbred pedigrees that have sizes of 
136, 273, and 557, respectively. Family members of the top three genera- 
tions are removed to mimic the practical situations (especially in human 
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data) in which cHnical information and DNA samples are most hkely not 
available for more than three generations back. The genealogy of the en- 
tire pedigree remains for calculating the correlation matrix p. Removing the 
family members from the top three generations, the pedigree sizes reduce to 
124, 251, and 526, respectively. For each founder (an individual with par- 
ents' genetic information unknown), the marker genotype is simulated by 
random mating. The genotypes of descents are generated according to the 
Mendelian law of segregation. 

To assess the type I error rate, for each individual, traits are generated ge- 
netically according to an SNP with the minor allele frequency (MAF) of the 
SNP being set to 0.3. Denote the genotype of the SNP by G that G = 0, 1, or 
2 for having 0, 1, or 2 allele 1 in the genotype. We simulated the quantitative 
trait, X, from N{—1 + G,a'^) with a = 1.2. The binary trait was simulated 
from Bernoulli(pG) with pQ = 0.1, pi = 0.3, p2 = 0.4. Then, an SNP that is 
unlinked to the causal SNP is generated. The minor allele frequency of the 
SNP is set to 0.3 and 0.1. For each combination of settings, we generate 1,000 
replicates. For each simulated data set, we compute the Wg statistic for the 
unlinked SNP, and take the rejection threshold to be the (1 — a)th quantile 
of the Xi distribution. We run FBAT on each simulated data set. In FBAT, 
default options are chosen in most of the cases except that the "minsize" 
(the minimum number of informative families) is set to 4. To illustrate the 
preservation of the type I error by considering the correlation among related 
subjects, we perform the standard Armitage trend test [Armitage (1955)] 
that assumes independent subjects in the sample. The Armitage trend test 
was implemented using the "independence_test" function in the R package 
"coin" [R Development Core Team (2009)]. This function also allows test- 
ing on the quantitative trait. We consider a = 0.05 and 0.01. In Table 1 we 
summarize the empirical rejection rates at each significance level for each 
combination of settings. The simulation results indicate that the Xi distri- 
bution approximates the distribution of the Wq statistic well. The inflation 
of the null empirical rejection rate using the trend test is obvious (indicated 
in bolded numbers) in the single large pedigree study. 

To compare the power with the FBAT method, we simulate the quanti- 
tative trait and the binary trait conditioning on the genotype of each in- 
dividual. The minor allele frequency of the association marker is set to 0.3 
and 0.1. Three different genetic models are considered for both the quanti- 
tative and binary trait. The quantitative trait X is generated according to 
an additive model: Xi = a + bOi + Ei, where 




-1 



for the homozygous genotype that Yi = 0, 
for the heterozygous genotype that 1^ = 1/2 
for the homozygous genotype that Yi = 1. 
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Table 1 

Type I error assessment — single large pedigree study design (6 generations) 



Sample size 



MAF 




Trait 




124 






251 






526 




GQLS 


FBAT 


Trend 


GQLS 


FBAT 


Trend 


GQLS 


FBAT 


Trend 


0.3 


0.05 


bt^ 


0.049 


0.051 


0.086 


0.045 


0.049 


0.103 


0.049 


0.051 


0.069 








0.048 


0.046 


0.152 


0.057 


0.048 


0.11 


0.053 


0.051 


0.107 




0.01 


bt 


0.006 


0.007 


0.019 


0.009 


0.009 


0.033 


0.011 


0.011 


0.015 






qt 


0.015 


0.011 


0.069 


0.009 


0.008 


0.03 


0.013 


0.01 


0.034 


0.1 


0.05 


bt 


0.052 


0.076 


0.048 


0.047 


0.038 


0.053 


0.048 


0.048 


0.065 






qt 


0.045 


0.057 


0.086 


0.054 


0.051 


0.059 


0.054 


0.044 


0.062 




0.01 


bt 


0.013 


0.004 


0.009 


0.009 


0.006 


0.007 


0.013 


0.012 


0.018 






qt 


0.014 


0.008 


0.016 


0.012 


0.012 


0.017 


0.011 


0.009 


0.008 



^Monte Carlo standard deviation — 0.0069 or 0.0031 for a — 0.05 or 0.01, respectively, ^bt: 
binary trait, ^qt: quantitative trait. 



The random environmental errors ej, are generated from iV(0,(T^). Without 
loss of generality, we set the intercept a = 0. We specify three different asso- 
ciation models: (1) 6 = 0.5, a = 1.2; (2) 6 = 1,C7 = 1.5; and (3) 6= 1,ct = 1.2. 
The coefficient b quantifies the effect of the marker. The different values 
of (7^ pose different levels of difficulty for the detection of genetic associa- 
tion. These three models are denoted by qtl, qt2, and qt3, respectively, in 
the tables that summarize the results of power assessments. 

For the binary trait, we generate the affection status of individuals ac- 
cording to three disease models. In model 1 we consider a recessive epistasis 
disease controlled by two SNPs that are unlinked to each other. Individuals 
having two copies of allele 1 at both SNPs have a penetrance [defined as 
/ = P(affected|genotype)] of /i = 0.5. Individuals having two copies of al- 
lele 1 at one SNP but not at the other SNP have a penetrance of /2 = 0.4. 
Individuals with fewer than two copies of allele 1 at both SNPs have a pen- 
etrance of /a = 0.1. In model 2 we consider a dominant epistasis disease 
controlled by two SNPs that are unlinked to each other. Individuals with 
at least one copy of allele 1 at both SNPs have a penetrance of /i = 0.5. 
All other individuals have a penetrance of /2 = 0.1. In model 3 we consider 
a single disease locus model with /i = 0.5 if an individual has two allele 
I's at the SNP, /2 = 0.3 if an individual has one allele 1 at the SNP, and 
/s = 0.1 otherwise. These three models are denoted by btl, bt2, and bt3, 
respectively, in the tables that summarize the results of power assessments. 

For each combination of settings, we generate 1,000 replicates. For each 
simulated data set, we compute the Wq and obtain the p-value by the Xi 
approximation. We run FBAT on each simulated data set. The proportions 
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Table 2 

Power comparison — single large family study design (6 generations) 



Sample size 

124 251 526 



MAF 


Trait 


a 


GQLS 


FBAT 


GQLS 


FBAT 


GQLS 


FBAT 


0.3 


btl 


0.05 


0.228 


0.186 


0.368 


0.277 


0.603 


0.466 






U.Ui 


U.lUo 


u.ud 


U.loD 


U.IUD 


U.OoD 


U.zo / 






u.uo 




U.ZD4 


U.DDO 


U.401 


u.yzo 


U. / D4 






0.01 


0.218 


0.089 


0.444 


0.237 


0.791 


0.548 




Dto 


U.Uo 


U.DlU 


U.4Z1 


n oi 
u.yi 


U. / Uo 


n 007 
u.yy ( 


u.yo 






0.01 


0.385 


0.181 


0.758 


0.439 


0.868 


0.851 


0.1 


btl 


0.05 


0.098 


0.063 


0.103 


0.076 


0.135 


0.099 








u.uo 


n m 

u.u± 


u.uo 


U.U -LU 


u.uuo 


n (197 




bt2 


0.05 


0.164 


0.116 


0.211 


0.132 


0.294 


0.187 






0.01 


0.063 


0.029 


0.088 


0.036 


0.145 


0.77 




bt3 


0.05 


0.452 


0.33 


0.624 


0.424 


0.911 


0.721 






0.01 


0.263 


0.119 


0.442 


0.167 


0.804 


0.463 


0.3 


qtl 


0.05 


0.468 


0.416 


0.781 


0.734 


0.977 


0.96 






0.01 


0.242 


0.19 


0.572 


0.491 


0.905 


0.869 




qt2 


0.05 


0.831 


0.791 


0.991 


0.970 


1 


1 






0.01 


0.656 


0.558 


0.946 


0.897 


0.999 


0.999 




qt3 


0.05 


0.943 


0.909 


0.999 


0.994 


1 


1 






0.01 


0.836 


0.758 


0.995 


0.981 


1 


1 


0.1 


qtl 


0.05 


0.261 


0.228 


0.401 


0.364 


0.707 


0.650 






0.01 


0.126 


0.074 


0.214 


0.173 


0.483 


0.403 




qt2 


0.05 


0.511 


0.451 


0.730 


0.674 


0.968 


0.944 






0.01 


0.326 


0.23 


0.54 


0.428 


0.916 


0.842 




qt3 


0.05 


0.658 


0.602 


0.868 


0.82 


0.994 


0.979 






0.01 


0.469 


0.349 


0.736 


0.642 


0.978 


0.922 


of p- 


values < a 


are 


reported in 


Table 2. 


Simulat 


ion results show that our 



method outperforms the FBAT for a higher detection power in all scenarios. 
Results are particularly striking for the binary trait with small sample size. 

We extend our simulation studies to a single pedigree that consists of 
nine generations. Genotypes and clinical information of family members in 
the top six generations are removed. The genealogy of the entire pedigree 
remains for calculating the correlation matrix p. We generate two single 
large pedigrees having sizes of 704 and 875, respectively. After removing the 
family members in the top six generations, there are 615 and 795 individ- 
uals remaining. Similarly, we set the MAF of 0.3 and 0.1. The results of 
type I error and power assessments are summarized in Tables 1 and 2 in 
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Table 3 

Type I error assessment — multiple families study design 



Sample size 



MAF 


q: 


Trait 




100 






200 






500 




GQLS 


FBAT 


Trend 


GQLS 


FBAT 


Trend 


GQLS 


FBAT 


Trend 


0.3 


0.05 


bt 


0.055 


0.037 


0.1 


0.048 


0.053 


0.09 


0.056 


0.051 


0.057 






qt 


0.056 


0.058 


0.0654 


0.052 


0.049 


0.07 


0.055 


0.054 


0.064 




0.01 


bt 


0.012 


0.005 


0.025 


0.012 


0.010 


0.018 


0.012 


0.013 


0.012 






qt 


0.013 


0.010 


0.009 


0.013 


0.008 


0.020 


0.011 


0.011 


0.015 


0.1 


0.05 


bt 


0.054 


0.037 


0.059 


0.050 


0.045 


0.0068 


0.05 


0.05 


0.065 






qt 


0.048 


0.043 


0.082 


0.047 


0.048 


0.088 


0.043 


0.055 


0.070 




0.01 


bt 


0.015 


0.006 


0.011 


0.01 


0.009 


0.013 


0.007 


0.006 


0.013 






qt 


0.013 


0.007 


0.022 


0.007 


0.006 


0.031 


0.007 


0.006 


0.013 



the supplementary material [Feng et al. (2011)]. The simulation results are 
consistent to the results of the studies with six generations. The empirical 
type I error rates obtained by our method and the FBAT are close to each of 
the nominal significance levels. The trend test generally inflates the empiri- 
cal rejection rate under the null hypothesis (indicated in bolded numbers). 
Our method is generally more powerful than the FBAT. 

3.2. Multiple families study design. In this study families are grown fol- 
lowing the similar procedure as for the single large family study design except 
that families will grow for a maximum of three generations. The simulated 
sample comprises families and independent individuals. Family sizes range 
from 1 to 23 with an average size of 6.3. As in the single large pedigree 
study design, the genotype of founders is generated by random mating and 
the genotype of nonfounders is generated according to the Mendelian law of 
segregation. We let the sample size (number of subjects) be 100, 200, and 
500, respectively. To assess the type I error rate, we generate a quantitative 
trait and a binary trait for each individual as described in the single large 
family study design. Then, an SNP that is unlinked to the causal SNP is 
generated. The minor allele frequency of the SNPs is set to 0.3 and 0.1. 
For each combination of settings, we generate 1,000 replicates. In Table 3 
we summarize the null empirical rejection rates. The results indicate that 
the Xi distribution approximates the distribution of the Wq statistic well. 
The inflation of the null empirical rejection rate using the trend test is 
observed. For power comparisons, we simulate the quantitative traits and 
binary traits according to the six models that have been described in the 
previous section. The MAF of the association marker is also set to 0.3 and 
0.1. The powers achieved by our method and the FBAT under each combi- 



14 FENG, WONG, GAO AND SCHENKEL 

Table 4 

Power comparison — multiple small pedigree study design 



Sample size 

100 200 500 



MAF 


Trait 


q; 


GQLS 


FBAT 


GQLS 


FBAT 


GQLS 


FBAT 


0.3 


btl 


0.05 


0.22 


0.17 


0, 


.302 


0.171 


0.610 


0.388 






U.Ui 


u.uyo 




n 

u. 


1 'i^ 


u.u * 


0/109 


91 Pi 






U.UtJ 




n 1 78 


0, 


.561 


0^17 


Q'^ 








0.01 


0.170 


0.044 


u. 


.ooy 


0.11 


0.818 


0.42 




IJIO 


u.uo 


u.ooy 


U.OZO 


0, 


.829 


Pil /I 




Q1 7 






0.01 


0.38 


0.13 


u. 


.D40 


0.255 


0.982 


0.78 


0.1 


btl 


0.05 


0.095 


0.063 


0, 


.095 


0.082 


0.118 


0.073 






0.01 


0.029 


0.008 


0, 


.041 


0.01 


0.032 


0.011 




bt2 


0.05 


0.132 


0.078 


0, 


.183 


0.107 


0.322 


0.16 






0.01 


0.046 


0.011 


0, 


.078 


0.024 


0.148 


0.06 




bt3 


0.05 


0.118 


0.073 


0, 


.322 


0.16 


0.942 


0.634 






0.01 


0.032 


0.011 


0, 


.148 


0.06 


0.843 


0.361 


0.3 


qtl 


0.05 


0.433 


0.309 


0, 


.709 


0.546 


0.98 


0.928 






0.01 


0.217 


0.114 


0, 


.478 


0.302 


0.921 


0.777 




qt2 


0.05 


0.795 


0.624 


0, 


.969 


0.849 


1 


1 






0.01 


0.58 


0.369 


0, 


.913 


0.704 


1 


1 




qt3 


0.05 


0.934 


0.792 


0, 


.999 


0.976 


1 


1 






0.01 


0.817 


0.552 


0, 


.99 


0.691 


1 


1 


0.1 


qtl 


0.05 


0.215 


0.153 


0, 


.351 


0.264 


0.746 


0.597 






0.01 


0.079 


0.046 


0, 


.157 


0.091 


0.520 


0.331 




qt2 


0.05 


0.456 


0.279 


0, 


.734 


0.502 


0.986 


0.726 






0.01 


0.228 


0.09 


0, 


.515 


0.251 


0.943 


0.796 




qt3 


0.05 


0.647 


0.391 


0, 


.895 


0.625 


0.99 


0.977 






0.01 


0.419 


0.149 


0, 


.745 


0.4 


0.992 


0.916 



nation of settings are summarized in Table 4. Simulation results show that 
our method consistently outperforms FBAT for all scenarios. 

3.3. Data with subpopulations. In this section we consider the situation 
that a sample contains individuals from different populations. Similarly to 
the previous section, we consider biallelic markers. For illustration, we con- 
sider a sample collected from two subpopulations only. In fact, for each of 
the previous study designs, the single large pedigree and the multiple small 
pedigrees, we combine two simulated data sets with different MAF to make 
up a sample that consists of individuals from two different populations. For 
example, in the single large pedigree study design, we combined the two sim- 
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ulated samples from two subpopulations with MAF being set to 0.1 and 0.3, 
and with different combinations of sample sizes for each subpopulation. For 
each combined sample, the Waii is the sum of the two Wq statistics from two 
subsamples. The p-values are obtained by the distribution. The type I 
error rate and the power are compared between our method and FBAT. 

In the supplementary material, Table 3, we summarize the results of type I 
error rates assessment by combining two single large pedigrees [Feng et al. 
(2011)]. Similarly, in the supplementary material, Table 4, we summarize the 
results of type I error assessment by combining the two simulated samples 
of multiple small pedigrees [Feng et al. (2011)]. Overall, the empirical type I 
error rates obtained by our method using the Waii test statistics and the 
empirical type I error rates obtained by FBAT are close to each of the nomi- 
nal significance levels. However, FBAT is slightly less stable. For example, in 
Table 3, the empirical type I error rate is 0.005 at 0.01 significance level for 
a quantitative trait when combining the sample size of 124 from population 1 
and sample size of 526 from the population 2. In Table 4, the empirical error 
rate is 0.033 at 0.05 significance level for a binary trait when combining the 
sample sizes of 100 from both population 1 and population 2. Both of the 
95% confidence intervals constructed based on these two empirical type I 
error rates do not cover the true values of a = 0.01 and 0.05. 

In the supplementary material. Tables 5 and 6, we summarize the results 
of power assessment [Feng et al. (2011)]. The simulation results indicated 
that the performance of our method and FBAT are comparable that one 
shows some advantages over the other under some suituations, and vice 
versa. 

4. Real data analysis. 

4.1. Application to Holstein cattle data. The data set contains 821 proge- 
ny-tested proven bulls born between 1965 and 2001. Each bull was genotyped 
using the Affymetrix MegAllele GeneChip Bovine mapping lOK SNP array 
[Affymetrix Inc. (2005)]. Among 821 bulls, some bulls also appear as the 
sires of other bulls. The relationships among bulls and their sires and dams 
are complicated. All of the 821 bulls sampled have genetically contributed 
to the current Canadian cow population. Most of the animals in the popu- 
lation have a nonzero inbreeding coefficient. A genealogy of the population 
tracing back 25 generations, with the oldest animal born in 1909, was used 
to compute the kinship-inbreeding coefficient with the software CFC [Sar- 
golzaei, Iwaisaki and Colleau (2006)]. Out of 9,919 genotyped SNPs, only 
8,624 SNPs have known location on the 29 Bos Taurus autosome chromo- 
somes (BTA). SNPs with more than 20% of missing values or MAF of less 
than 5% were excluded from the study. A total of 7,103 SNPs were ana- 
lyzed. The experimental design is mainly a granddaughter design that the 
milk productivities of daughters and granddaughters of a bull are used to 
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estimate the breeding value of the bull. The phenotypes used in the analysis 
were trait EBVs released in November 2008 and provided by the Canadian 
Dairy Network (CDN, Guelph, Canada). For illustration, we only present 
results of the association tests with milk yield EBV. 

In Table 5 we report the top 81 most significant SNPs that have p-va- 
lue < 0.001 that can be grouped into 36 regions (SNPs at a close inter- 
distance, less than IcM, define a region) on 16 BTAs. Out of 36 significant 
SNPs or regions, 16 significant SNPs or regions have been found in agree- 
ment with the quantitative traits loci or associated SNPs reported in the 
literature. In BTA14, 22 SNPs concentrated in 0-27cM have strong associ- 
ation with milk yield and their p- values range from 6.45 x 10"^'^ to 0.001. 
At the telomere of BTA14, Daicylglycerol acyl transferase 1 (DGATl) at 
OcM has been considered to be a quantitative trait nucleotide with a major 
effect on milk yield [Bennewitz et al. (2003); Boichard et al. (2003); Grisart 
et al. (2004)]. An SNP at 0.27cM has a strong association signal. Twelve 
SNPs in the region of 3.38-8.47cM are consistent with 3 SNPs at 4cM, 5cM, 
and 6cM that have been reported significantly associated with milk yield by 
Daetwyler et al. (2007) and Bennewitz et al. (2003). An SNP at 11.2cM also 
confirms the association with milk yield reported by Daetwyler et al. (2007). 
The most significant SNP is found at 94cM on BTAS and confirms a QTL 
at the same location reported by Viitala et al. (2003). A significant SNP at 
98cM also confirms a QTL at the same location reported by Viitala et al. 
(2003). Note that, after adjusting for Bonferroni's correction at 5% signifi- 
cance level (or at 7.13x10"^ individual significance level), 11 regions remain 
significant. However, for many complex traits that are controlled by several 
genes, each individual gene may only have a small effect. When thousands of 
SNPs are tested, using the Bonferroni's correction may result in low power 
of the study. Therefore, when we interpret the Bonferroni result, we need to 
be careful that some signals disappearing after the adjustment may be due 
to the conservativeness of Bonferroni's correction. 

4.2. Application to COGA data. The Collaborative Study on the Ge- 
netics of Alcoholism (COGA) data set was provided by the Genetic Anal- 
ysis Workshop 14 (GAW14). The data set included 1,614 individuals from 
143 families. Among 1,614 individuals, 1,351 individuals were genotyped for 
a panel of 11,555 SNPs from Affymetrix. A set of alcoholism phenotypes and 
covariates were provided. We use the ALDXl as the phenotype. Individuals 
who are coded as "affected" in the ALDXl variable are considered as affected 
individuals. Unaffected individuals are those coded as "pure" unaffected in 
the ALDXl. Individuals with other codings are considered to have unknown 
phenotypes. In this study, we compare our method with FBAT under three 
scenarios. In scenario 1 we consider a large sample from a single population. 
We only include individuals who are coded as "white, non-Hispanic." There 
are 119 such families consisting of 1,074 individuals. In scenario 2 we con- 
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Table 5 






Adost signtp/Cant loci 


(p -value ^ 0.001) found fov 


TTiilk yield tvait 




INo. 01 bINFs 


Location (cM) 


p- value 


1 


1 


47.90^ 


2.18x10"^ 


4 


1 


20.05^ 


0.000105 




2 


56.65, 59.81^ 


0.000664 




1 


101.74 


0.000126 


5 


1 


1.03^ 


0.00086 




1 


8.32 


2.91 X 10"® 




3 


29.59-34.46 


7.5 X 10"® 




6 


45.51-50.53 


4.65 X 10"'^* 




1 


69.89 


8.8 X 10-*^ 




8 


73.49-77.77 


8.44 X 10"^* 




12 


90.76-101.06*^'^'^ 


3.14 X 10""* 




1 


114.90^ 


0.000125 


6 


1 


47.66'^ 


0.000355 


7 


1 


75.07^^ 


0.001 


8 


1 


41.75 


0.000215 




1 


55 


0.000126 


11 


1 


113.46 


0.000853 


12 


1 


61.77^ 


5.81 X 10"^* 


14 


1 


Q 273.4.5.6 


2.06 X 10"^* 




12 


3.38-8.47^'^ 


3.86 X 10"** 




3 


11.2*'^ 


4.94 X 10"*^* 




4 


21.50* 


6.45 X 10"^°* 




2 


26.69 


0.000691 


15 


1 


21 


0.000291 


16 


1 


31.66 


3 3Q X 1 fl"* 




1 


54 


0.000364 




1 


62 


0.000946 




1 


90.54* 


3.85 X 10"® 


17 


1 


16 


0.000595 




1 


72 


0.00034 




1 


78.58 


0.000868 


18 


1 


15.78* 


1.75 X 10"^* 


23 


2 


9.36 


4.44x10"*^* 


26 


2 


44, 45 


0.000341 




1 


53^'* 


0.00061 


27 


1 


57 


0.000962 


^Chromosomal region that the SNPs span on. ^Minimum p- 


-value if there is more that 



one SNP in the region. '^In agreement with Bennewitz et al. (2003). *In agreement with 
Boichard et al. (2003). ®In agreement with Daetwyler et al. (2007). ®In agreement with 
Grisart et al. (2004). ^In agreement with Heyen et al. (1999). *In agreement with Viitala 
et al. (2003). ^In agreement with Viitala (2008). Significant at 5% Bonferroni's correction 
(at 7.13x10^* individual significance level). 
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sider a small sample from a single population. We only include individuals 
who are coded as "white, Hispanic." There are 11 such families consisting of 
78 individuals. In scenario 3 we combine the two samples from the the two 
populations of "white, Hispanic" and "white, non-Hispanic." In our studies, 
we use the software Kinlnbcoef to compute the kinship coefficient for cor- 
relation matrix p. We only analyze SNPs that are on autosomes. In total, 
there are 10,532 SNPs on autosomes. 

The results based on our method are summarized in the supplementary 
material, Table 7 [Feng et al. (2011)]. In total, there are 22 SNPs found to 
be significant (p-values < 0.001) in the "white, Hispanic" sample, 19 SNPs 
are found to be significant based on the "white, non-Hispanic" sample, 
and 24 SNPs are found to be significant based on the pooled samples of 
"white." There are 19 SNPs that are significant in both the pooled sample 
and the "white, Hispanic" or in both the pooled sample and the "white, 
non-Hispanic" sample. On chromosome 2, SNP tsc0052826 is significant in 
both the "white Hispanic" sample and the pooled sample; it is 0.344cM 
from a marker that had been reported for a significant linkage with alco- 
hol dependence [Hih et al. (2004); Valdes, McWeeney and Thomson (1999)]. 
On chromosome 6, SNP tscl395926 is significant in both the "white His- 
panic" sample and the pooled sample. It is very close to two loci (less than 
1Mb) that had been found to link to the alcoholism [Hill et al. (2004); Ma 
et al. (2005)]. On Chromosome 7, SNPs tsc0333356 is significant in both 
the "white Hispanic" sample and the pooled sample; it is 1.47cM away from 
a marker that had been reported to significantly link to ALDXl by Zhu et al. 
(2005) and is O.SllcM from a marker that has shown significant linkage to 
alcohol dependence by Hill et al. (2004). The most significant SNP is SNP 
tsc0059716 on chromosome 13 (p- value = 4x 10~^), which is about 2.4cM 
away from an SNP that had been reported to significantly associate with 
ALDXl [Zhu et al. (2005)]. In total, there are 12 SNPs found to be very close 
to regions or SNPs that had been reported to link or associate with alcohol 
dependence or alcoholism related traits in the literature. After adjusting for 
Bonferroni's correction at 5% significance level (or at 4.75x10"^ individual 
significance level), four SNPs (tsc0587314 on chromosome 3, tsc0506913 on 
chromosome 5, tsc0630829 on chromosome 7, and tsc0059716 on chromo- 
some 13) remain significant. 

The results based on FBAT are summarized in Table 8 in the supplemen- 
tary material [Feng et al. (2011)]. In total, there are 43 SNPs found to be 
significant (p-value < 0.001) in the pooled sample, 29 SNPs are significant 
in the "white, non-Hispanic" sample, and only one SNP is significant in the 
"white, Hispanic" sample. Among these significant SNPs, SNP tsc0056748 
on chromosome 13 is significant in more that one sample (the pooled sam- 
ple and the "white, non-Hispanic" sample). There are 17 significant SNPs 
in the pooled sample that had been reported significantly associated with 
the ALDXl by Zhu et al. (2005). Note that the results in Zhu et al. (2005) 
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are based on the same pooled sample of "white, Hispanic" and "white, non- 
Hispanic" same definition of "affected" individual, and are analyzed by the 
FBAT as well. The only difference is the definition of "unaffected" individual, 
in that we only use "pure-affected" individuals while Zhu et al. (2005) use 
"pure-unaffected" and "never drank." Therefore, there would be more sig- 
nificant SNPs confirmed by Zhu et al. (2005). In addition, SNP tsc0046578 
on chromosome 1 is 1.37cM away from an SNP that had been reported 
to significantly link to alcohol dependence by Prescott et al. (2006). SNP 
tsc0697701 on chromosome 8 is 0.7Mb away from an SNP that significantly 
links to the alcoholism by Hill et al. (2004). SNP tsc0896393 on chromosome 
12 is 1.5Mb away from an SNP that significantly links to ALDXl reported 
by Ma et al. (2005). After adjusting for the Bonferroni correction, three 
SNPs (tsc0515272 on chromosome 3, tsc0029429 on chromosome 9, and tsc 
1750530 on chromosome 16) remain significant. 

5. Discussion. In this article we adopt the framework of the generalized 
linear model and assume that the expected marker allelic frequency is con- 
nected to the linear predictor based on the trait of interest through an arbi- 
trary specified link function. Although we focus on the logistic link, which is 
the canonical link for a binomial random variable, models utilizing other link 
functions can be built with minor modifications of the approach herein. The 
population-based association study is still a popular study design for com- 
mon traits. To prevent spurious association due to a confounding population 
structure, association studies should be performed within a relative homoge- 
neous population. Such a population-based association study is a special case 
of our method in which the p matrix will be an identity matrix for indepen- 
dent subjects. For the stratified population, Lander and Schork (2006) sug- 
gested using "internal controls" to balance the ethnicity between the cases 
and controls in the sample in order to eliminate the confounding effects. Our 
proposed generalized association method uses all available family members 
to provide natural "internal controls." Conneally (2003) pointed out that for 
any choice of study design, whether based on families or population-based, 
a large sample size is needed to detect an associated gene with only a par- 
tial effect on the trait. The quasi-likelihood scoring method fully utilizes the 
correlation information among the sampled individuals. It accommodates 
various data types for genetic association studies including the conventional 
population-based association studies, and those using founder /isolated pop- 
ulations with documented genealogy, or multiple complex pedigrees. Thus, 
this method essentially increases the sample size and becomes more pow- 
erful. On the other hand, when a data set contains samples from multiple 
subpopulations, we propose a solution that combines the Wq statistics from 
each subpopulation to construct a new test statistic Waii- The Waii statis- 
tic is founded to follow an distribution asymptotically with the degrees 
of freedom depending on the number of subpopulations and the number of 
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alleles of the marker being tested. Simulation results confirm that the 
distribution approximates the distribution of Waii well. Simulation results 
also show that our method has comparable power to the FBAT. However, 
our approach is limited to known subpopulations. If unknown subpopula- 
tions exist, it is possible to extend our approach to a mixture population 
with more population parameters to be estimated. 

It is known that pedigree errors can easily arise in the study of large pedi- 
grees and even in the study of small pedigrees. Our GQLS method cannot 
handle this error directly. However, many methods and software are available 
to detect such errors under different study designs [PREST by McPeek and 
Sun (2000); RELATIVE by Goring and Ott (1997); RELPAIR by Epstein, 
Duren and Boehnke (2000)]. When the pedigree errors are found, involved 
individuals could be either removed from the study accordingly, or, the re- 
lationship, that is, the kinship and inbreeding coefficients, among involved 
individuals can be inferred through the genome scan (if genome data are 
available) as a substitute in the p matrix. However, the approximation of 
the distribution to the resulting Wg statistic needs to be further inves- 
tigated. 
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SUPPLEMENTARY MATERIAL 

Mathematical justifications and additional results 

(DOI: 10.1214/11-AOAS465SUPP; .pdf). The supplementary materials of 
the paper are organized as follows. Appendix A provides the theoretical 
justification of the variance-covariance matrix XIq. Appendix B derives the 
explicit form of the Wq statistic for a biallelic marker in a single pedigree 
study design. Appendix C derives the expression of the Wg statistic for 
a multi-allelic marker in a single pedigree study design. In Appendix D ad- 
ditional results of simulation studies and the results of COGA data analysis 
are summarized in tables. 
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